True, if the response variable has correlation 0 with all the predictor variables, then the only predictor would be the intercept, with slope 0. In the simple linear regression case, this would be a horizontal line going through the data
True, even if predictor variables are perfectly correlated, the model can still be a good fit for the data. Thinking geometrically, the \(dim(X) <p\) in the case of multicollineairty. However, since the space exists, we can still fit the data.
True, since \(\hat{Y} = \hat{Y}^*\), our sum of squares remains the same, thus having no influence on coefficients of multiple determination.
True, since the p-1 t-tests are not equivalent to testing whether there is a regression relation between Y and the set of X variables (as tested by the F test). When we have multicollinearity, this can be the case.
True, say for example we have a variable \(X_1\) and add \(X_2\), which is perfectly correlated with \(X_1\). The SSR remains unchanged, however the MSR becomes smaller. This could lead to having individually significant t tests for each coefficient but an insignificant model.
True, since both the error variance and the variance of the corresponding least squares estimator would be \(\sigma^2\).
True, \(\hat{\beta}^*\) is equal to \(\frac{s_y}{s_{x_k}} r_{yx_k}\), which is not influenced by other correlations with other X variables. It measures how much variation in \(X_k\) can explain \(Y\).
True, since the the inflated variance in \(\hat{\beta}^*_k\) is caused by the intercorrelation between \(X_k\) and the rest of the \(X\) variables.
we can have a large amount of variables that are uncorrelated with each other, which individually can be significant but create a not significant p-value as a whole.
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
library(plotly)
property <- read.table("property.txt")
colnames(property) <-
c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property
ggplotly(ggplot(data = property, aes(x = Age, y = Ren.Rate)) + geom_point())
We can see from the plot that there is no tell of a linear relationship between the age of a property and itβs rental rate.
We have model equation: \[ Y_i = \beta_0 + \beta_1\tilde{X_{i1}} + \beta_2X_{i2} + \beta_4X_{i4} + \beta_1\tilde{X_{i1}^2} \] NOTE: can fit X1 for X1 tilde
property["AgeCent"] <- property$Age - mean(property$Age)
property["AgeSq"] <- property$AgeCent ^ 2
polyModel <-
lm(Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
summary(polyModel)
##
## Call:
## lm(formula = Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89596 -0.62547 -0.08907 0.62793 2.68309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.019e+01 6.709e-01 15.188 < 2e-16 ***
## AgeCent -1.818e-01 2.551e-02 -7.125 5.10e-10 ***
## AgeSq 1.415e-02 5.821e-03 2.431 0.0174 *
## Exp 3.140e-01 5.880e-02 5.340 9.33e-07 ***
## Sq.Foot 8.046e-06 1.267e-06 6.351 1.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.097 on 76 degrees of freedom
## Multiple R-squared: 0.6131, Adjusted R-squared: 0.5927
## F-statistic: 30.1 on 4 and 76 DF, p-value: 5.203e-15
#Plotting Observations Against Fitted Values
ggplotly(
ggplot() + aes(x = polyModel$fitted.values, y = property$Ren.Rate) + geom_point() + labs(x = "Fitted Values", y = "Observations", title = "Observartions against Fitted Values")
)
We have the regression function: \[ Y_i = 10.19 - 0.182X_{i1} + 0.314X_{i2} + 0.00008X_{i4} + 0.014X_{i1}^2 \] We find that our model is a good fit. We have a relatively good \(R^2_{adj}\) as well as fairly linear Observations against Fitted Values plot.
# Model 2
model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
summary(model2)
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0620 -0.6437 -0.1013 0.5672 2.9583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.237e+01 4.928e-01 25.100 < 2e-16 ***
## Age -1.442e-01 2.092e-02 -6.891 1.33e-09 ***
## Exp 2.672e-01 5.729e-02 4.663 1.29e-05 ***
## Sq.Foot 8.178e-06 1.305e-06 6.265 1.97e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.132 on 77 degrees of freedom
## Multiple R-squared: 0.583, Adjusted R-squared: 0.5667
## F-statistic: 35.88 on 3 and 77 DF, p-value: 1.295e-14
We find that both the \(R^2\) and \(R^2_{adj}\) are higher in the quadratic model than the Model 2. The \(R^2\) for Model 2 is \(0.583\) and \(0.6131\) for the quadratic model. The \(R^2_{adj}\) for Model 2 is \(0.5667\) and \(0.5927\) for the quadratic model. This would lead us to conclude that the quadratic model is a better fit than Model 2.
To test our full model versus our reduced model, we have: \[ H_0: \beta_j = 0\ \text{for all} \ j\in \mathbf J\\ H_a: \text{not all} \ \beta_j: \ j\in \mathbf J\\ \] With test statistic and null distribution: \[ F^* = \frac{\frac{SSE(R)-SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} \\ F^* \sim F_{(1- \alpha, df_R - df_F, df_F)} \]
We reject \(H_0\) if \(F^* > F_{(1- \alpha, df_R - df_F, df_F)})\).
#Find crtical value
qf(1 - 0.05, 77-76, 77)
## [1] 3.965094
anova(polyModel, model2)
Given that our value for our \(F^*\) is \(5.9078\), we reject \(H_0\) and conclude that our quadratic term is significant in the model at \(\alpha = 0.05\).
#Our prediction for model 2
predict(model2, data.frame(Age = 4, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
## fit lwr upr
## 1 15.11985 12.09134 18.14836
# Our prediction for quadratic model
predict(polyModel, data.frame(AgeCent = 4, AgeSq = 16, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
## fit lwr upr
## 1 13.47259 10.47873 16.46645
We can see that when we predict with a quadratic model, we get a lower valued interval than that of the prediction with Model 2.